From: PHL Geo Inc. TO: Philadelphia City Council Subject: Introduction of a new development monitoring tool
Dear City Council Members,
We are PHL Geo Inc., a (finctional) Philadelphia-based civil-service oriented coding organization using the power of geospatial analysis and city-owned data to provide informed insights to decision makers such as yourself. We are reaching out to introduce an exciting new tool which can be used in your districts immediately. The tool is called the GENTRISK and uses several years of permit and construction information to predict where development may happen in your community in the near future. The GENTRISK may be used in your district to manage the negative impacts of development, including trash dumping, street closures, and noise complaints. Getting ahead of development impacts can improve coordination and communication with residents in your district, improve the likelihood of successful permit applications, and enable smart zoning and growth in your city council region.
Development is a process determined by numerous factors and can have positive effects; however, development-induced gentrification, which is “a process a neighborhood change that includes economic change in a historically disinvested neighborhood - by means of real estate investment and new higher-income residents moving in - as well as demographic change - not only in terms of income level, but also in terms of changes in the education level or racial make-up of residents,” can have negative effects (Demsas, 2021). As city council members looking to promote economic development while ensuring that residents have the protections that they need, the GENTRISK can provide tools that will help you see where development pressures are happening and engage in proactive outreach to those communities. Aligning city services such as increasing access to Philadelphia’s Homestead Exemption or to the PA Homeowner Assistance fund, can reduce these negative impacts.
The GENTRISK uses the permit information described and combines that with external factors including the distance to schools, tree density, and distance to public transit, in order to understand what is driving development in Philadelphia. The GENTRISK then uses the relationships between those factors to predict where development may occur so you can get ahead of any potential negative impacts. These predictions form the basis of several tools available in GENTRISK, such as scenario planning. In the scenario planning framework, you may modify inputs to see how they will impact development going forward. For example, if the new Philly Tree Plan is likely to bring more tree canopy to your community, you can use the GENTRISK to model how much further development is likely to occur.
The following document provides a technical overview of how the GENTRISK functions, how to interpret its results, and how it can be incorporated into your city planning techniques. We look forward to seeing GENTRISK integrated into your city planning toolkit.
For questions, comments, concerns, or to learn more, please email phlgeo@gmail.com (fake email!).
Kindly, Akira Di Sandro Benjamin Myers
PHL GEO Inc. Founders
The GENTRISK tool uses the construction of new housing and commercial buildings as its core output, as construction provides both opportunities (in the form of economic development and increased tax base) and pressures (in the form of gentrification and noise/trash) on residents and neighborhoods. Construction information is available from the Department of Licenses and Inspection for a time period stretching over the last 10 years. While neighborhoods are constantly in a state of change, focusing in on a more specific time period can provide City Council with an accurate understanding of what is driving development in the city. The time period from 2015 to 2018 provides a good baseline of understanding development drivers as it is prior to the impacts that the COVID-19 global pandemic had on development drivers, yet recent enough that certain key drivers - such as the turnover of permanent residents for renters - are still prevalent within Philadelphia.
# philly permit data
dat_permit <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+permits&filename=permits&format=geojson&skipfields=cartodb_id")
# only keep new construction permits
newcon_permits <- dat_permit %>%
filter(grepl("NEW CON|NEWCON",typeofwork)) %>%
mutate(year = substr(permitissuedate, 1,4)) %>%
filter(year %in% c(2013:2019,2022)) %>%
st_transform(crs = 2272)
rm(dat_permit) # remove this big file because we won't use this anymore
city_council <- st_read("https://opendata.arcgis.com/api/v3/datasets/1ba5a5d68f4a4c75806e78b1d9245924_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1")
city_council <- city_council %>%
mutate(DISTRICT = factor(DISTRICT, levels = 1:10))
# census data
# variables of interest:
# B19013_001: medHHincome
# B01001_001: total pop
# B02001_002: white pop
# B25058_001: median rent
# B15003_022: attainment of bachelor's of population 25+
# B25071_001: Median gross rent as a percentage of household income (past year)
# B07201_001: Geographical Mobility in the Past Year for Current Residence--Metropolitan Statistical Area Level in the United States, estimated total
# B07201_002: Geographical Mobility in the Past Year for Current Residence--Metropolitan Statistical Area Level in the United States, same house 1 year ago
# B25008_002: total population in occupied housing by tenure, owner occupied
# B25008_003: total population in occupied housing by tenure, renter occupied
# B99082_001: allocation of private vehicle occupancy
# B25044_001: tenure by vehicles available
# B09002_001: own children under 18
# B11004_001: related children under 18
census_vars <- paste0(c("B01001_001", "B19013_001", "B02001_002", "B25058_001", "B15003_022", "B25071_001", "B07201_001",
"B07201_002", "B25008_002", "B25008_003", "B99082_001", "B25044_001", "B09002_001",
"B11004_001"),"E") # census variables of interest that are available
medHHinc2019 <- 70459 # source: https://www.deptofnumbers.com/income/pennsylvania/philadelphia/
tracts19 <-
get_acs(geography = "block group",
variables = census_vars,
year = 2019, state = 42,
geometry = T, output = "wide") %>%
st_transform(crs = 2272) %>%
dplyr::select(!matches("M$")) %>%
rename(total_pop = B01001_001E,
med_hh_inc = B19013_001E,
white_pop = B02001_002E,
med_rent = B25058_001E,
bachelors25 = B15003_022E,
pct_rent_hhinc = B25071_001E,
mobility_tot_metro = B07201_001E,
samehouse1yr_metro = B07201_002E,
owner_occ = B25008_002E,
renter_occ = B25008_003E,
private_vehicle_occ = B99082_001E,
vehicles_avail = B25044_001E,
) %>%
mutate(children = B09002_001E + B11004_001E,
year = "2019") %>%
dplyr::select(!matches("^B|white_pop",ignore.case = F)) %>%
st_centroid()
# philly neighborhood data
nhoods_path <- 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
nhoods <- st_read(nhoods_path, quiet = T) %>%
st_transform(crs = 2272) %>%
dplyr::select(mapname)
# philly bounds
philly <- st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
st_transform(crs = 2272) %>%
dplyr::select(OBJECTID,geometry)
# limit permit dat to philly border
newcon_permits <- newcon_permits %>%
.[philly,]
# 311 incident reports data
carto_url = "https://phl.carto.com/api/v2/sql"
# Crime incidents
table_name_311 = "public_cases_fc"
# query
where_311 = "closed_datetime >= '2010-01-01' AND closed_datetime < '2020-01-01' AND service_name IN ('Rubbish/Recyclable Material Collection','Illegal Dumping','Construction Complaints', 'Building Construction', 'Sanitation Violation', 'Street Trees', 'Dangerous Sidewalk', 'Homeless Encampment Request', 'Parks and Rec Safety and Maintenance', 'Vacant House or Commercial')"
query311 = paste("SELECT *",
"FROM", table_name_311,
"WHERE", where_311)
reports311 = rphl::get_carto(query311, format = "csv", base_url = carto_url, stringsAsFactors = F)
reports311 <- reports311 %>%
mutate(Year = year(format.Date(reports311$closed_datetime)))
reports311 <- reports311 %>%
filter(!is.na(lon) & !is.na(lat)) %>%
st_as_sf(coords = c("lon","lat"), crs = 4326) %>%
st_transform(crs = 2272) %>%
mutate(
type = str_replace_all(service_name, "[^[:alnum:]]", ""),
Legend = "311") %>%
dplyr::select(Legend, type, geometry, Year)
# our CRS is in feet, but want to make fishnet a 500m x 500m
cell_size <- 500 * 3.28084
fishnet <-
st_make_grid(philly,
cellsize = cell_size,
square = TRUE,
crs = 2272) %>%
.[philly] %>%
st_sf() %>%
mutate(uniqueID = 1:n())
# 2013
{
permit_net13 <-
dplyr::select(newcon_permits %>% filter(year == "2013")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits13 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE))
}
# 2014
{
permit_net14 <-
dplyr::select(newcon_permits %>% filter(year == "2014")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits14 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2015
{
permit_net15 <-
dplyr::select(newcon_permits %>% filter(year == "2015")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits15 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2016
{
permit_net16 <-
dplyr::select(newcon_permits %>% filter(year == "2016")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits16 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2017
{
permit_net17 <-
dplyr::select(newcon_permits %>% filter(year == "2017")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits17 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2018
{
permit_net18 <-
dplyr::select(newcon_permits %>% filter(year == "2018")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits18 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2019
{
permit_net19 <-
dplyr::select(newcon_permits %>% filter(year == "2019")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits19 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# combine permit counts from all years
permit_net_allyrs <- permit_net13 %>%
st_drop_geometry() %>%
dplyr::select(uniqueID,cvID,count_permits13) %>%
left_join(permit_net14 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits14), by = "uniqueID") %>%
left_join(permit_net15 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits15), by = "uniqueID") %>%
left_join(permit_net16 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits16), by = "uniqueID") %>%
left_join(permit_net17 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits17), by = "uniqueID") %>%
left_join(permit_net18 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits18), by = "uniqueID") %>%
left_join(permit_net19 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits19), by = "uniqueID") %>%
left_join(permit_net13 %>% dplyr::select(-c(cvID,count_permits13,count_permits)), by = "uniqueID")
The model powering GENTRISK requires that all of the variables are gathered in the same format and then compared across time. To facilitate this process, the PHL GEO Inc. team used the following code to gather all of the variables and put them into a similar defined by a 500m x 500 m grid (see figure 1 for an example of a fishnet grid). These variables are summarized to the grid level and can then provide an understanding of how neighborhoods are changing over time, followed by an investigation of what could be driving those changes by asking the core question, “What are the relationships between indicators and development (permits) over the last several years, what are the strengths of those relationships, and how can they be used to predict development in the future?”
ggplot() +
geom_sf(data = permit_net19, aes(fill = count_permits19)) +
scale_fill_viridis() +
labs(title = "Permit Count in each grid cell",
fill = "2019",
caption = "Figure 1.") +
plotTheme(title_size = 14) + theme(legend.position = "bottom")
A key way to contextualize these changes is to look at the changes that are happening within city council districts throughout Philadelphia. As is shown through figure 2, below, there are 10 city council districts in Philadelphia where changes and development pressures will have to be accounted for. We will be focusing on the fishnet- and neighborhood-level measures, as city council districts are much bigger, smoothing over small but important details.
# plot city council districts
ggplot() +
geom_sf(data=city_council, aes(fill=DISTRICT)) +
scale_fill_viridis(discrete = T) +
labs(fill = "City Council District",
title = "City Council Districts Within Philadelphia",
caption = "Figure 2.") +
mapTheme()
Development patterns may be spread over space, such that they forms a smooth ‘mosaic’ across the city of Philadelphia. Alternatively, it is possible that several permits may be clustered together, either because they were all submitted as part of the same construction project (ie, submitting an electrical permit and roofing permit for the same property) or there were several units being worked in tandem. To better understand whether Philadelphia is experiencing a ‘mosaic’ of development or if development has been clustered into particular areas, a fishnet was established for grid cells of 500 meter by 500 meters across the city. The 500 meter scale (~1/3rd mile) allows for a fluid understanding of development patterns at the block scale.
Figure 3 below shows the number of permits granted for each fishnet square in 2013, 2015, 2017, and target year 2019. As is visible through the maps, trends and patterns of development shift across the city, such that in 2013 development centered around the Center City Neighborhood, while moving towards 2019 - when record number of permits were granted for the time period in focus - development was centered in Center City as well as North Philadelphia.
It is worthwhile to note that while there are a number of areas with extensive permit counts and development pressures, so too are there a significant number of fishnet cells in the city that did not experience much - if any - development. These low-no development areas, from the perspective of new construction, could have an outside influence on the model with impacts on its accuracy in areas experiencing higher levels of development. This was factored into the model through the use of a Poisson distribution, described in greater detail below.
# map of permit count over time
permit_net_formap <- permit_net_allyrs %>%
dplyr::select(-c(uniqueID,cvID,count_permits14,count_permits16,count_permits18)) %>%
rename(`Permit Count 2013` = count_permits13,
`Permit Count 2015` = count_permits15,
`Permit Count 2017` = count_permits17,
`Permit Count 2019` = count_permits19) %>%
gather("count_permits", "value", -geometry) %>%
st_as_sf()
permit_count_map <- ggplot(data = permit_net_formap) +
geom_sf(aes(fill = value)) +
scale_fill_viridis() +
facet_wrap(~count_permits) +
labs(title = "Count of New Construction Permits for the fishnet",
caption = "Figure 3.") +
mapTheme(title_size = 14) + theme(legend.position="bottom")
permit_count_map
There are a number of different methods to account for the impacts of development, all of which provide insight into its social (noise, street closures, dumping), environmental (rubbish collection, street tree impairment), and economic factors. The GENTRISK takes into account several census features, which are described below (along with their reasoning), as well variables from Open Data Philly, a not for profit information sharing service, to incorporate these factors into a single model.
The census variables include the following:
# function to make fishnet from acs dataframe (dat_tract), for variable (var_name), giving it the legend (var_name), aggregating with function (sum_or_mean)
# for example, to make a fishnet summing all total_pop variables for 2019, one would define the variables as follows:
# dat_tract <- tracts19; var_name <- "total_pop"; sum_or_mean <- sum
makenet <- function(dat_tract, var_name, sum_or_mean){
net_tract <- dat_tract %>%
.[philly,] %>%
dplyr::select(matches(var_name)) %>%
mutate(Legend = var_name) %>%
st_join(fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = sum_or_mean(get(var_name), na.rm = T)) %>%
left_join(fishnet, ., by = "uniqueID") %>% # add geometry back in
spread(Legend, count, fill=0) %>% # fill in ones where fishnet was missing, count was NA with 0
dplyr::select(-`<NA>`) %>%
ungroup()
return(net_tract)
}
var_names <- names(tracts19)[3:14]
# for loop to create fish nets for all ACS variables for all years
for (i in 1:length(var_names)){
# select year, variable, and legend name
var_name <- var_names[i]
if (i %in% c(2:4,18)){
sum_or_mean <- mean
} else {
sum_or_mean <- sum
}
# run function to make fishnet
net_tract <- makenet(tracts19,var_name,sum_or_mean)
# save as individual fishnet, uncomment if neededd
# net_tract_name <- paste0("net_",var_name,yr)
# assign(net_tract_name, net_tract)
# save all variables into one big net
if (i == 1) {
census_net <- net_tract %>%
st_drop_geometry()
} else {
net_tract_tojoin <- net_tract %>%
st_drop_geometry() %>%
dplyr::select(1:2)
census_net <- left_join(census_net, net_tract_tojoin, by = "uniqueID")
}
}
In addition to census information, the GENTRSIK takes into account other factors and amenities which could increase the likelihood of development and subsequent pressures. The incorporation of these factors builds on a geospatial risk model where the likelihood that that an increase in the following factors would correspond to an increase in the number of permits granted. A resident could use this to ask the question “given certain characteristics about my neighborhood, what is the likelihood, and the amount, of permit granting that I can expect to see?”
A primary challenge of using data in these models is that it is available across multiple scales and times; to counteract this challenge the GENTRISK model focuses on the distribution of the following factors from 2015 to 2018, and then uses this information to validate its results in 2019. Information was gathered from 311 calls sent to the city, particularly as they relate to new construction impacts. As 311 calls are openly accessible to all, they represent data that is much more flexible and inclusive than data solely collected by the city.
The 311 variables that were gathered from Open Data Philly include:
long_15 <- reports311 %>%
filter(Year == 2015) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_16 <- reports311 %>%
filter(Year == 2016) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_17 <- reports311 %>%
filter(Year == 2017) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_18 <- reports311 %>%
filter(Year == 2018) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_19 <- reports311 %>%
filter(Year == 2019) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
vars311_net <- rbind(long_15, long_16, long_17, long_18, long_19) %>%
st_join(fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
left_join(fishnet, ., by = "uniqueID") %>% # add geometry back in
spread(Legend, count, fill=0) %>% # fill in ones where fishnet was missing, count was NA with 0
dplyr::select(-`<NA>`) %>%
ungroup()
# use something like code below (from lab 6) to create nearest neighbor counts
# convenience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid
vars311_net_ccoid <- st_c(st_coid(vars311_net))
# add nearest neighbor features
vars311_net <- vars311_net %>%
mutate(buildingCon15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "BuildingConstruction2015")), 3),
dangerSide15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "DangerousSidewalk2015")), 3),
illegalDump15.dist = nn_function(vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "IllegalDumping2015")), 3),
parksNrec15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "ParksandRecSafetyandMaintenance2015")), 3),
materialColl15.dist = nn_function(
vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "RubbishRecyclableMaterialCollection2015")), 3),
streetTrees15.dist = nn_function(vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "StreetTrees2015")), 3),
vacant15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "VacantHouseorCommercial2015")), 3),
buildingCon16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "BuildingConstruction2016")), 3),
dangerSide16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "DangerousSidewalk2016")), 3),
illegalDump16.dist = nn_function(vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "IllegalDumping2016")), 3),
parksNrec16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "ParksandRecSafetyandMaintenance2016")), 3),
materialColl16.dist = nn_function(
vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "RubbishRecyclableMaterialCollection2016")), 3),
streetTrees16.dist = nn_function(vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "StreetTrees2016")), 3),
vacant16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "VacantHouseorCommercial2016")), 3),
buildingCon17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "BuildingConstruction2017")), 3),
dangerSide17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "DangerousSidewalk2017")), 3),
illegalDump17.dist = nn_function(vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "IllegalDumping2017")), 3),
parksNrec17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "ParksandRecSafetyandMaintenance2017")), 3),
materialColl17.dist = nn_function(
vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "RubbishRecyclableMaterialCollection2017")), 3),
streetTrees17.dist = nn_function(vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "StreetTrees2017")), 3),
vacant17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "VacantHouseorCommercial2017")), 3),
buildingCon18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "BuildingConstruction2018")), 3),
dangerSide18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "DangerousSidewalk2018")), 3),
illegalDump18.dist = nn_function(vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "IllegalDumping2018")), 3),
parksNrec18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "ParksandRecSafetyandMaintenance2018")), 3),
materialColl18.dist = nn_function(
vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "RubbishRecyclableMaterialCollection2018")), 3),
streetTrees18.dist = nn_function(vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "StreetTrees2018")), 3),
vacant18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "VacantHouseorCommercial2018")), 3),
buildingCon19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "BuildingConstruction2019")), 3),
dangerSide19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "DangerousSidewalk2019")), 3),
illegalDump19.dist = nn_function(vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "IllegalDumping2019")), 3),
parksNrec19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "ParksandRecSafetyandMaintenance2019")), 3),
materialColl19.dist = nn_function(
vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "RubbishRecyclableMaterialCollection2019")), 3),
streetTrees19.dist = nn_function(vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "StreetTrees2019")), 3),
vacant19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "VacantHouseorCommercial2019")), 3))
# join census vars fishnet to permit count net + 311 net
all_net <- left_join(permit_net_allyrs, census_net, by = "uniqueID") %>%
left_join(vars311_net %>% st_drop_geometry(), by = "uniqueID") %>%
st_as_sf()
All census variables were pulled at the block group level, which is a smaller scale than the 500x500 meter fishnet used in this analysis. As there is no perfect way to join block-group-level census data to our fishnet, we joined the most center point of each block group to the fishnet, and aggregated data where necessary. As an example, if a fishnet encompassed census blocks A, B, and C, with respective populations of 100,150, and 125, then the total fishnet population would be 375. This enables a smooth comparison across the surface of Philadelphia. In some cases where the centroid of the block group falls outside the boundary of the fishnet square, the resulting fishnet square population would be 0. While this introduces a certain level of error into the calculations, the error can be calculated and corrective actions and more informed decisions can be made.
With all of the potential variables to incorporate, it is important to evaluate which factors have the greatest potential influence on the prediction of the permit count (the proxy for development), which can be done through the chart below, also called a Correlation Matrix. In the matrix, boxes with a deep red color indicate a positive relationship, such that as the value increases, the predicted permit count increases. Boxes with a deep blue color indicate that as the variable increases in value, the predicted permit count decreases. We established our model to use information from 2015 - 2018, and test our predictions on 2019, the last pre-covid impacted year. Using these relationships we can establish clear linkages between 2019 predictions and future predictions.
From the matrix below, we can see that previous year’s permit count (ie, the development that occurred in 2018) emerged as the single most important factor the count of future permits (in this case, modelling the development that would occur in 2019).
The strength of the relationships is 0.9, such that a 1% increase in the number of permits granted for new construction in the year prior would net an 0.9% increase in the count of permits for the year in which we are predicting. Earlier years also have a relationship, such that a 1% increase in the number of permits granted three years prior (in 2016) would explain about a 0.6% increase in the number of permits in the target year.
Non-temporal factors that have a high influence include 311 Building Construction Complaints (1% increase means a 0.6% increase in permit count), Illegal Dumping complaints (1% increase means a 0.5% increase in permit count), and Rubbish/Recyclable Material Collection complaints (1% increase explains a 0.4% increase in permit count).
# making this moreso to see which would actually be good predictors
var_for_model <- c("uniqueID","cvID","count_permits19","count_permits18","count_permits17","count_permits16",
"count_permits15","BuildingConstruction2019","IllegalDumping2019","DangerousSidewalk2019",
"RubbishRecyclableMaterialCollection2019","StreetTrees2019","parksNrec19.dist",
"VacantHouseorCommercial2019","vehicles_avail","med_rent","renter_occ","total_pop",
"samehouse1yr_metro","children","bachelors25")
all_net19 <- all_net %>%
dplyr::select(all_of(var_for_model))
# only keep years for count_permit columns
names(all_net19) <- c(str_replace_all(var_for_model, c("2019" = "","19." = ".")), "geometry")
for_cormat19 <- all_net19 %>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID,cvID))
ggcorrplot(
round(cor(for_cormat19), 1),
p.mat = cor_pmat(for_cormat19),
colors = c("#4b2875", "white", "#9c1339"),
type="lower",
insig = "blank",
digits = 4,
lab = T, lab_size = 2)+
labs(caption = "Figure 4")
Recall that our goal is to predict the amount of development that will occur in the future, using permits granted for new construction as a proxy. To assess future construction, the model gathers the permit counts from previous years as well as the variables described above, and then will predict on the permit count in 2019. However, as is shown in figure 5 below, as the majority of fishnet squares do not see permit applications in a given year. In order to evaluate the true relationships between the variables it is important to factor out the influence of the low-no values through the use of a Poisson regression. While low-no values may also be due to errors in changing the spatial scale of the census blocks to the fishnet squares, the majority of these are due to no permit applications. A Poisson regression is the best model when predicting counts as the outcome.
all_net19 %>% ggplot(aes(x = count_permits19)) +
geom_histogram(bins = 66, fill = viridis::cividis(1)) +
theme(text = element_text( color = "black"),
plot.title = element_text(size = 14, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)) +
labs(title = "Permit Distribution (2019)",
x = "Count of New Construction Permits", y = "Count",
caption = "Figure 5.")
In addition to controlling for the influence of low-no permit count fishnets, it is important to account for any spatial variability in the model, so that the clusters of permit counts, as described earlier, do not spatially correlate with other variables. If there is a spatial auto-correlation between the two, this could provide an out sized influence for certain variables over others and influence the results. Figure 6 demonstrates certain spatial trends, such as the fact that building construction and rubbish 311 complaints are the largest in Southeast Philadelphia (City Council District 2), while the Center City area has the highest concentration of renter-occupied housing as well as individuals holding a bachelors degree.
# add neighborhood names
allnet19_formodel <- all_net19 %>%
st_centroid() %>%
st_join(dplyr::select(nhoods, mapname)) %>%
st_drop_geometry() %>%
left_join(dplyr::select(all_net19, geometry, uniqueID), by = "uniqueID") %>%
st_sf()
vars_net.long <- gather(allnet19_formodel %>% dplyr::select(-count_permits19),
variable, value, -geometry, -uniqueID, -cvID, -mapname)
vars <- unique(vars_net.long$variable)
varList <- list()
for (i in vars) {
varList[[i]] <- ggplot() +
geom_sf(data = filter(vars_net.long, variable == i),aes(fill = value), colour=NA) +
scale_fill_viridis(name = "") +
labs(title = i) +
theme(text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 7),
legend.position = "bottom",
legend.text = element_text(size = 5))
}
do.call(grid.arrange,c(varList, ncol = 3, top = "Predictors of Permit Count (on fishnet)", bottom = "Figure 6."))
A Local Moran’s I test was done to establish if the new construction permits were randomly distributed through space, or if they were clustered together in a non-random fashion. We defined that any clustering with p < 0.001 were statistically significant hot spots, which emerged in fishnet cells slightly north, in, and south of the Center City area (Figure 7).
final_net.nb <- poly2nb(as_Spatial(allnet19_formodel), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE) # turn neighborhood weights into list of weights
local_morans <- localmoran(allnet19_formodel$count_permits19, final_net.weights, zero.policy=TRUE) %>%
as.data.frame() # Ii moran's I at ith cell, Ei expected/mean from neighbors
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(allnet19_formodel)) %>%
st_sf() %>%
dplyr::select(`Permit Count` = count_permits19,
`Local Morans I` = Ii,
`P Value` = `Pr(z != E(Ii))`) %>%
mutate(`Significant Hotspots` = ifelse(`P Value` <= 0.001, 1, 0)) %>%
gather(variable, value, -geometry)
# now plot
vars <- unique(final_net.localMorans$variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, variable == i),aes(fill = value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 14) + theme(legend.position="bottom")
}
do.call(grid.arrange,c(varList, ncol = 2, top = "Local Moran's I Statistics for Permit Count in Philadelphia",
bottom = "Figure 7."))
final_net <-
allnet19_formodel %>%
mutate(permitct.isSig =
ifelse(localmoran(allnet19_formodel$count_permits19,
final_net.weights)[,5] <= 0.001, 1, 0)) %>%
mutate(permitct.isSig.dist =
nn_function(st_coordinates(st_centroid(allnet19_formodel)),
st_coordinates(st_centroid(
filter(allnet19_formodel, permitct.isSig == 1))), 1))
The following scatterplots in Figure 8 demonstrate that there are significant relationships between certain factors including the previous year’s count (count_permits18) with an R value of 0.86, followed by the permit count from two years prior (count_permits17 with an R value of 0.76). In several of the features, ‘peaks’ of values appear, such as in the marker for children, being in the same house after 1 year in the metro area, and 311 complaints for street trees in the year that is being predicted. These high peaks of values indicate that there is significant variation in those features and thus might be more susceptible to outliers.
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -mapname) %>%
gather(variable, value, -count_permits19)
correlation.cor <-
correlation.long %>%
group_by(variable) %>%
summarize(correlation = cor(value, count_permits19, use = "complete.obs"))
ordered_vars <- c("count_permits18","count_permits17","count_permits16","count_permits15","total_pop","bachelors25",
"children","med_rent","renter_occ","samehouse1yr_metro","vehicles_avail","BuildingConstruction",
"IllegalDumping","DangerousSidewalk","StreetTrees","RubbishRecyclableMaterialCollection",
"VacantHouseorCommercial","parksNrec.dist","permitct.isSig","permitct.isSig.dist")
ggplot(correlation.long %>%
mutate(variable = factor(variable, levels = ordered_vars)),
aes(value, count_permits19)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor %>%
mutate(variable = factor(variable, levels = ordered_vars)),
aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~variable, ncol = 3, scales = "free") +
labs(title = "Permit count as a function of predictors",
caption = "Figure 8.") +
theme(text = element_text( color = "black"),
plot.title = element_text(size = 14, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 10))
After iterations of testing, the following combination of predictors gave the best results for our model.
# just risk factors
reg.vars <- c("count_permits18","count_permits17","count_permits16","count_permits15","BuildingConstruction",
"IllegalDumping","DangerousSidewalk","RubbishRecyclableMaterialCollection","vehicles_avail",
"med_rent","renter_occ","VacantHouseorCommercial","total_pop","samehouse1yr_metro",
"StreetTrees","children","parksNrec.dist","bachelors25")
# random k-fold CV
reg.CV <- crossValidate(dataset = final_net,
id = "cvID",
dependentVariable = "count_permits19",
indVariables = reg.vars) %>%
dplyr::select(cvID, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.MAE <- mean(abs(reg.CV$error)) # ~7.628154
# with local Moran's I spatial process features
reg.sp.vars <- c(reg.vars,"permitct.isSig","permitct.isSig.dist")
## RUN REGRESSIONS
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "count_permits19",
indVariables = reg.sp.vars) %>%
dplyr::select(cvID, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.spatial.MAE <- mean(abs(reg.spatialCV$error)) # ~5.611653
# LOGO-CV
# needed to redefine crossValidate to solve the issue with a neighborhood of NA
# this function also works for the previous lines using crossValidate()
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
# cvID_list <- unique(dataset[[id]])
cvID_list <- as.character(na.omit(unique(dataset[[id]])))
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, all_of(indVariables),
all_of(dependentVariable))
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, all_of(indVariables),
all_of(dependentVariable))
form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
form <- as.formula(form_parts)
regression <- glm(form, family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
# adding neighborhood for LOGO CV, risk factors only
reg.logoCV <- crossValidate(
dataset = final_net,
id = "mapname",
dependentVariable = "count_permits19",
indVariables = reg.vars) %>%
dplyr::select(cvID = mapname, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.logo.MAE <- mean(abs(reg.logoCV$error)) # 8.273702
# risk factors + spatial process
reg.logo.spatialCV <- crossValidate(
dataset = final_net,
id = "mapname",
dependentVariable = "count_permits19",
indVariables = reg.sp.vars) %>%
dplyr::select(cvID = mapname, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.logo.spatial.MAE <- mean(abs(reg.logo.spatialCV$error)) # 6.170643
We created two types of models – one with just risk factors as the predictors and another with both risk factors and spatial process. To increase accuracy, we only kept the model with both risk factors and spatial process, and used the model without spatial predictors as a check against the predicted development. The spatially-accounted for model factored in the role that spatial clustering has in predicting future development and the potential for it to falsely group or attribute future development to variables. Using the leave-one-group-out cross validation method (LOGO-CV) will iterate through each neighborhood and predict the amount of development that is expected. It will then compare that development against the hold-out community, leaving expected and predicted values. The Random K-folds methods was also used (where k = 100). As figure 8 demonstrates, incorporating the spatial factors into our model increased the accuracy. In the LOGO-CV method, excluding the spatial factors on the left hand side increased the percentage error (the difference between what was predicted and what was observed), while including the spatial factors greatly reduced the error, especially visible in South Philadelphia.
reg.summary <- rbind(
mutate(reg.CV,
Error = Prediction - count_permits19,
Regression = "Random k-fold CV: Predictors"),
mutate(reg.spatialCV,
Error = Prediction - count_permits19,
Regression = "Random k-fold CV: Predictors with Spatial Process"),
mutate(reg.logoCV,
Error = Prediction - count_permits19,
Regression = "Spatial LOGO-CV: Predictors"),
mutate(reg.logo.spatialCV,
Error = Prediction - count_permits19,
Regression = "Spatial LOGO-CV: Predictors with Spatial Process")) %>%
st_sf()
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - count_permits19, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
# make map
# adjust mapTheme in order to fit title
mapTheme <- function(base_size = 12, title_size = 16) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 7))
}
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Errors by Cross Validation method",
caption = "Figure 9.") +
mapTheme(title_size = 14) + theme(legend.position="bottom")
Figure 10 shows the real vs predicted counts of new construction permits in 2019. The model here tends to over predict the number of construction permits that will occur in the Center City area, yet under predicts in some north-west parts of the city. The real permit count (in comparison to the predicted permit count) is more distributed, indicating that it is possible the predicted permit count is being thrown off by extreme values.
reg.summary %>%
filter(Regression == "Random k-fold CV: Predictors with Spatial Process") %>%
dplyr::select(-c(cvID,error,Error)) %>%
rename(`Real Permit Count` = count_permits19) %>%
gather("counts","value",-geometry,-Regression) %>%
ggplot() +
geom_sf(aes(fill = value)) +
scale_fill_viridis() +
facet_wrap(~counts) +
labs(title = "Real vs Predicted Counts of Permits (2019)",
caption = "Figure 10.") +
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 10),
legend.position = "bottom")
Looking at the results and distribution of the error in a bar graph format (as is shown in figure 11) supports the hypothesis that extreme values are skewing model results, while affirming that accounting for the spatial processes is an important part of the analysis. Accounting for the spatial analysis, particularly in the LOGO-CV method, removes the influence of the 350 count variable visible in the far right hand portion of that graph. However, even after the spatial variables have been accounted for, the spatial LOGO-CV method indicates that there are approximately 10 cells with a permit count error greater than 100. This is a significant error maring and can be significantly skewing the results.
# adjust plotTheme to fit titles
plotTheme <- function(base_size = 12, title_size = 16) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 10)
)
}
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 450, by = 50)) +
labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
x="Mean Absolute Error", y="Count",
caption = "Figure 11.") +
plotTheme(title_size = 14) + theme(legend.position="bottom")
Table 1 below shows the summary of regressions, including the mean and standard deviation of MAE for both of the models and both of the cross validation methods. The highlighted rows show the results of our preferred model. As is clear, including the spatial process decreased the average deviation and error by 1.52 (with a random k-folds CV method) and using the spatial LOGO-CV method decreases the mean MAE by 3.34.
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(`Mean MAE` = round(mean(MAE), 2),
`SD MAE` = round(sd(MAE), 2)) %>%
kable(caption = "Table 1: Summary of Regressions") %>%
kable_styling("striped", full_width = F) %>%
row_spec(c(2,4), bold = T, color = "white", background = viridis::cividis(1)) %>%
kable_classic(full_width = F, html_font = "Cambria")
| Regression | Mean MAE | SD MAE |
|---|---|---|
| Random k-fold CV: Predictors | 4.49 | 12.85 |
| Random k-fold CV: Predictors with Spatial Process | 2.76 | 4.07 |
| Spatial LOGO-CV: Predictors | 10.08 | 31.36 |
| Spatial LOGO-CV: Predictors with Spatial Process | 6.74 | 14.61 |
RaceContext <- get_acs(geography = "block group",
variables = c("B01001_001E","B02001_002E"),
year = 2019, state = 42,
geometry = T, output = "wide") %>%
st_transform(crs = 2272) %>%
dplyr::select(!matches("M$")) %>%
rename(total_pop = B01001_001E,
white_pop = B02001_002E) %>%
mutate(pct_white = ifelse(total_pop > 0, white_pop / total_pop,0),
RaceContext = ifelse(pct_white > 0.5, "Majority White",
ifelse(total_pop != 0 , "Majority non-White", NA))) %>%
dplyr::select(-c(NAME,total_pop,white_pop,pct_white)) %>%
.[nhoods,]
race_map <- reg.summary %>%
filter(grepl("LOGO", Regression)) %>%
st_centroid() %>%
st_join(RaceContext) %>%
na.omit() %>%
dplyr::select(-c(count_permits19:GEOID)) %>%
st_drop_geometry() %>%
left_join(nhoods, by = c("cvID" = "mapname")) %>%
st_sf()
ggplot() +
geom_sf(data = race_map, aes(fill = RaceContext), color = "darkgrey") +
scale_fill_viridis(discrete = T) +
labs(fill = "Race Context",
title = "Racial Make-up of Philadelphia Neighborhoods",
subtitle = "2019",
caption = "Figure 12.") +
mapTheme() + theme(legend.position = "bottom")
In order to assess if the model generalizes to different neighborhood contexts, we tested whether the model generalizes to a racial and income context. Using 2019 US Census data, we defined a neighborhood to be “Majority White” if over 50% of the population was White, and “Majority non-White” otherwise (see figure 12). We defined a neighborhood to be “Above Median Income” if the mean median household income in that neighborhood was above the median income in Philadelphia for 2019, and “Below Median Income” otherwise (see figure 13).
IncomeContext <- get_acs(geography = "block group",
variables = c("B19013_001E"),
year = 2019, state = 42,
geometry = T, output = "wide") %>%
st_transform(crs = 2272) %>%
dplyr::select(!matches("M$")) %>%
rename(med_hh_inc = B19013_001E) %>%
mutate(IncomeContext = ifelse(med_hh_inc > medHHinc2019, "Above Median Income", "Below Median Income")) %>%
dplyr::select(-c(NAME,med_hh_inc)) %>%
.[nhoods,]
inc_map <- reg.summary %>%
filter(grepl("LOGO", Regression)) %>%
st_centroid() %>%
st_join(IncomeContext) %>%
na.omit() %>%
dplyr::select(-c(count_permits19:GEOID)) %>%
st_drop_geometry() %>%
left_join(nhoods, by = c("cvID" = "mapname")) %>%
st_sf()
ggplot() +
geom_sf(data = inc_map, aes(fill = IncomeContext), color = "darkgrey") +
scale_fill_viridis(discrete = T) +
labs(fill = "Income Context",
title = "Income Context of Philadelphia Neighborhoods",
subtitle = "2019",
caption = "Figure 13.") +
mapTheme() + theme(legend.position = "bottom")
Table 2 shows that though the difference is relatively small for the model with added spatial process features, both models overestimate permit count for majority non-White neighborhoods. Overall, the model with spatial features generalizes well with respect to race.
reg.summary %>%
filter(grepl("LOGO", Regression)) %>%
st_centroid() %>%
st_join(RaceContext) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, RaceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(RaceContext, mean.Error) %>%
kable(caption = "Table 2. Mean Error by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, bold = T, color = "white", background = viridis::cividis(1)) %>%
kable_classic(html_font = "Cambria")
| Regression | Majority non-White | Majority White |
|---|---|---|
| Spatial LOGO-CV: Predictors | 2.9338193 | 0.4763929 |
| Spatial LOGO-CV: Predictors with Spatial Process | 0.4727099 | 0.1580031 |
Similarly, we tested the generalizability of our model in an income context. Table 3 shows us that the model with spatial features added generalizes better than the model without in the context of income, though both models seem to overestimate permit count for neighborhoods that are considered “Blow Median Income”.
reg.summary %>%
filter(grepl("LOGO", Regression)) %>%
st_centroid() %>%
st_join(IncomeContext) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, IncomeContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(IncomeContext, mean.Error) %>%
kable(caption = "Table 3. Mean Error by Neighborhood Income Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, bold = T, color = "white", background = viridis::cividis(1)) %>%
kable_classic(html_font = "Cambria")
| Regression | Above Median Income | Below Median Income |
|---|---|---|
| Spatial LOGO-CV: Predictors | 0.2946100 | 2.9462298 |
| Spatial LOGO-CV: Predictors with Spatial Process | 0.3275409 | 0.7222918 |
While development is a constantly shifting trend, it is possible to model and develop an understanding of what is driving it within Philadelphia. The GENTRISK does this process by looking at the five year span of permit data for new construction, which can be a proxy for development, from 2015 - 2018. This time period enables an analysis that is robust yet also free from the influence of COVID-19 global pandemic. A number of factors, including the amount of owner-occupied vs. renter-occupied housing, the presence of bike lanes, and complaints about things such as dangerous sidewalks were sourced and their ability to relate and predict the amount of development in a target year (2019 for the testing purposes of the model) was established. From this, the previous year’s development, the XYZ, and XYZ emerged as the factors with the most explanatory power. After model results were obtained, checking the values using the leave one group out cross validation demonstrated that accounting for spatial variables also improved model results and yielded decreased margins of error across the board, and particularly for racial groups and income groups.
The model at the core of GENTRISK has been developed, and additional capabilities will be added,including the ability to change inputs the model, such as the input tree layer, development (permit count) layer, and other factors. This powerful capability will enable city council members to model the impact that processes such as new development plan may have on their residents. Furthermore, the core model will be improved to minimize the influence that outsized numbers of permits may have on the model’s accuracy, and to provide more robust analytical power in low-income and majority-minority communities.